Transcription factors or epigenetic marks may act on specific sets of genes grouped by a common biological feature (shared Biological function, common regulation in RNAseq experiment etc).
A frequent step in ChIPseq analysis is to test whether common gene sets are enriched for transcription factor binding or epigenetic marks. We performed similar analysis in our RNAseq workflow looking for enrichment of gene sets in differentially changing genes.
Sources of well curated genesets include GO consortium (gene’s function, biological process and cellular localisation), REACTOME (Biological Pathways) and MsigDB (Computationally and Experimentally derived).
The Gene Ontology consortium aims to provide a comprehensive resource of the currently available knowledge regarding the functions of genes and gene products.
Functional categories of genes are broadly split into three main groups.
The three sub categories of gene ontology are arranged in nested, structured graph with gene sets at the top of graph representing more general terms and those at the bottom more specific terms.
The Reactome and KEGG (Kyoto Encyclopedia of genes and genomes) contains information on genes’ membership and roles in molecular pathways.
These databases focus largely on metabolic and disease pathways, and allow us to investigate our genes in the context of not only functional roles but relative positions within pathways.
The molecular signatures database (MsigDB) is available from the Broad institute and provides a set of curated gene sets derived from sources such as GO, pathway databases, motif scans and even other experimental sets.
MsigDB databases are widely used in gene set enrichments analysis and are available as plain text in formats used with the popular Java based gene set enrichment software GSEA. ]
]
In R we can access information on these gene sets through database libraries (such as the Org.db.eg we have reviewed) such as GO.db, KEGG.db, reactome.db or by making use of libraries which allows us to our gene sets from parse plain text formats, GSEABase.
Geneset enrichment testing may be performed on the sets of genes with peaks associated to them. In this example we will consider genes with peaks within 1000bp of a gene’s TSS.
We will not access these database libraries directly in testing but will use other R/Bioconductor libraries which make extensive use of them.
To perform geneset testing here, we will use the GOseq package.
We must provide a named numeric vector of 1s or 0s to illustrate whether a gene had any peaks overlapping it’s TSS.
In this example we use all TSS sites we found to be overlapped by Myc peaks.
The peaks landing in TSS regions will be marked as “Promoter” in the annotation column of our annotated GRanges object.
## GRanges object with 1 range and 14 metadata columns:
## seqnames ranges strand | abs_summit fold_enrichment annotation
## <Rle> <IRanges> <Rle> | <integer> <numeric> <character>
## [1] chr1 4785370-4785641 * | 4785565 5.23296 Promoter
## geneChr geneStart geneEnd geneLength geneStrand geneId
## <integer> <integer> <integer> <integer> <integer> <character>
## [1] 1 4783572 4785692 2121 2 27395
## transcriptId distanceToTSS ENSEMBL SYMBOL
## <character> <numeric> <character> <character>
## [1] ENSMUST00000132625.1 51 ENSMUSG00000033845 Mrpl15
## GENENAME
## <character>
## [1] mitochondrial ribosomal protein L15
## -------
## seqinfo: 21 sequences from mm10 genome
We can extract the unique names of genes with peaks in their TSS by subsetting the annotated GRanges and retrieving gene names from the geneId column.
annotatedPeaksGR_TSS <- annotatedPeaksGR[annotatedPeaksGR$annotation == "Promoter",
]
genesWithPeakInTSS <- unique(annotatedPeaksGR_TSS$geneId)
genesWithPeakInTSS[1:2]## [1] "27395" "108664"
Next we can extract all genes which are included in the TxDb object to use as our universe of genes for pathway enrichment.
## 66 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 100009600 chr9 21062393-21073096 - | 100009600
## 100009609 chr7 84935565-84964115 - | 100009609
## 100009614 chr10 77711457-77712009 + | 100009614
## -------
## seqinfo: 66 sequences (1 circular) from mm10 genome
Once we have a vector of all genes we can create a named vector of 1s or 0s representing whether a gene had peak in TSS or not. We can turn a logical vector into 1 for TRUE and 0 for FALSE simply using the as.integer() function.
allGenesForGOseq <- as.integer(allGeneIDs %in% genesWithPeakInTSS)
names(allGenesForGOseq) <- allGeneIDs
allGenesForGOseq[1:3]## 100009600 100009609 100009614
## 1 0 0
Now we have the the input for GOseq we can test against KEGG (or GO if we choose) using a standard hypergeometric test.
First we must construct a nullp data.frame for use within goseq using the nullp() function and supplying our named vector, genome to be used and gene identifier used.
The nullp() function attempts to correct for gene length biases we may see in geneset testing. i.e. a longer gene may have more chance to have a peak within it.
## Can't find mm10/knownGene length data in genLenDataBase...
## Found the annotation package, TxDb.Mmusculus.UCSC.mm10.knownGene
## Trying to get the gene lengths from it.
We can see which genomes are supported using the supportedGenomes() function.
## db species date name
## 2 hg19 Human Feb. 2009 Genome Reference Consortium GRCh37
## 3 hg18 Human Mar. 2006 NCBI Build 36.1
## AvailableGeneIDs
## 2 ccdsGene,ensGene,exoniphy,geneSymbol,knownGene,nscanGene,refGene,xenoRefGene
## 3 acembly,acescan,ccdsGene,ensGene,exoniphy,geneid,geneSymbol,genscan,knownGene,knownGeneOld3,refGene,sgpGene,sibGene,xenoRefGene
Now we have created our nullp data.frame, we can use this within the goseq() function to test for the enrichment of genesets.
We also select a database of genesets to test, here “KEGG”. We can choose additionally from “GO:CC”, “GO:BP”, “GO:MF” genesets.
As we are testing peaks in 1000bp windows around genes’s TSSs we can simply use the “Hypergeometric” for the method parameter.
Kegg_MycPeaks <- goseq(pwf, "mm10", "knownGene", test.cats = c("KEGG"), method = "Hypergeometric")
Kegg_MycPeaks[1:2, ]## category over_represented_pvalue under_represented_pvalue numDEInCat
## 89 03010 8.744084e-33 1 76
## 90 03013 5.729961e-27 1 105
## numInCat
## 89 88
## 90 158
The dataframe of enrichment results contains KEGG IDs but no description of pathways and processes.
We can use the KEGG.db package to extract a mapping of IDs to descriptive statistics.
library(KEGG.db)
xx <- as.list(KEGGPATHID2NAME)
idtoName <- cbind(names(xx), unlist(xx))
idtoName[1:5, ]## [,1] [,2]
## 00010 "00010" "Glycolysis / Gluconeogenesis"
## 00020 "00020" "Citrate cycle (TCA cycle)"
## 00030 "00030" "Pentose phosphate pathway"
## 00040 "00040" "Pentose and glucuronate interconversions"
## 00051 "00051" "Fructose and mannose metabolism"
We can now merge our data.frame of ID to pathway name mappings to our data.frame of enrichment results.
KeggN_MycPeaks <- merge(idtoName, Kegg_MycPeaks, by = 1, all = TRUE)
orderByP <- order(KeggN_MycPeaks$over_represented_pvalue)
KeggN_MycPeaks <- KeggN_MycPeaks[orderByP, ]
KeggN_MycPeaks[1:5, ]## V1 V2 over_represented_pvalue
## 187 03010 Ribosome 8.744084e-33
## 188 03013 RNA transport 5.729961e-27
## 194 03040 Spliceosome 3.424264e-22
## 186 03008 Ribosome biogenesis in eukaryotes 2.183456e-16
## 145 00970 Aminoacyl-tRNA biosynthesis 1.458264e-11
## under_represented_pvalue numDEInCat numInCat
## 187 1 76 88
## 188 1 105 158
## 194 1 83 123
## 186 1 54 76
## 145 1 32 42
In addition to a standard enrichment tests, methods have been implemented specifically for ChIPseq. Many of these tools aim to incorporate peaks distal to genes into their enrichment testing such as the popular GREAT toolset.
Incorporating distal peaks by rules such as nearest gene results in some genes having a higher chance of being selected and hence some genesets as a whole having a higher chance of having its members selected.
GREAT defines regulatory regions for each, individual gene and compares the proportion of peaks mapping to a geneset’s regulatory regions to the proportion of the genome occupied by geneset’s regulatory regions.
i.e. If a gene set’s regulatory regions account for 1% of the genome then one might expect 1% of peaks to overlap these regions by chance.
We can use the GREAT Bioconductor interface available in the rGREAT package.
To submit jobs we can use our GRanges of Myc peaks and specify a genome with the submitGreatJob function.
This function returns a GreatJob object containing a reference to our results on the GREAT server. To review the categories of results available we can use the availableCategories function on our GreatJob object.
great_Job <- submitGreatJob(macsPeaks_GR, species = "mm10", version = "3.0.0", request_interval = 1)
availableCategories(great_Job)## [1] "GO" "Phenotype Data and Human Disease"
## [3] "Pathway Data" "Gene Expression"
## [5] "Regulatory Motifs" "Gene Families"
The results table can be retrieved using the getEnrichmentTables function and specifying which tables we wish to review.
Here we retrieve the results tables for the “Regulatory Motifs” genesets which contains 2 seperate database results.
## The default enrichment tables contain no associated genes for the input
## regions. You can set `download_by = 'tsv'` to download the complete
## table, but note only the top 500 regions can be retreived. See the
## following link:
##
## https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655401/Export#Export-GlobalExport
## [1] "MSigDB Predicted Promoter Motifs" "MSigDB miRNA Motifs"
Now we can review the enrichment of our genes with Myc peaks in their TSS for the “MSigDB Predicted Promoter Motifs” genesets.
## ID
## 1 SCGGAAGY_V$ELK1_02
## 2 GGGCGGR_V$SP1_Q6
## 3 CACGTG_V$MYC_Q2
## 4 RCGCANGCGY_V$NRF1_Q6
## name
## 1 Motif SCGGAAGY matches ELK1: ELK1, member of ETS oncogene family
## 2 Motif GGGCGGR matches SP1: Sp1 transcription factor
## 3 Motif CACGTG matches MYC: v-myc myelocytomatosis viral oncogene homolog (avian)
## 4 Motif RCGCANGCGY matches NRF1: nuclear respiratory factor 1
## Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits
## 1 0.05912876 814.6169 1586
## 2 0.20132920 2773.7130 3782
## 3 0.07555623 1040.9380 1638
## 4 0.05163166 711.3293 1195
## Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
## 1 1.946928 0.11511940 1.738355e-136
## 2 1.363515 0.27451550 1.823467e-94
## 3 1.573580 0.11889380 1.110696e-71
## 4 1.679953 0.08673877 2.182812e-65
## Binom_Adjp_BH Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits
## 1 1.069088e-133 1069 502.9698 806
## 2 5.607161e-92 2689 1265.1880 1812
## 3 2.276927e-69 941 442.7452 695
## 4 3.356073e-63 822 386.7551 606
## Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
## 1 1.602482 0.07821446 0.7539757
## 2 1.432198 0.17583700 0.6738565
## 3 1.569752 0.06744299 0.7385760
## 4 1.566883 0.05880640 0.7372263
## Hyper_Raw_PValue Hyper_Adjp_BH
## 1 1.739538e-83 5.349079e-81
## 2 5.523247e-114 3.396797e-111
## 3 1.936613e-65 3.970057e-63
## 4 1.614352e-56 2.482066e-54
A common practice in transcription factor ChIPseq is to investigate the motifs enriched under peaks.
Denovo motif enrichment can be performed in R/Bioconductor but this can be very time consuming. Here we will use the Meme-ChIP suite available online to identify denovo motifs.
Meme-ChIP requires a FASTA file of sequences under peaks as input so we extract this using the BSgenome package.
First we need to load the BSgenome object for the genome we are working on, UCSC’s mm10 build for the mouse genome, BSgenome.Mmusculus.UCSC.mm10.
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # provider: UCSC
## # provider version: mm10
## # release date: Dec. 2011
## # release name: Genome Reference Consortium GRCm38
## # 66 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_GL456372 chrUn_GL456378 chrUn_GL456379
## # chrUn_GL456381 chrUn_GL456382 chrUn_GL456383
## # chrUn_GL456385 chrUn_GL456387 chrUn_GL456389
## # chrUn_GL456390 chrUn_GL456392 chrUn_GL456393
## # chrUn_GL456394 chrUn_GL456396 chrUn_JH584304
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
The motif for the ChIP-ed transcription factor should in the centre of a peak. Meme-ChIP will trim our peaks to a common length internally if sequences are of different length.
It is best therefore to provide a peak set resized to a common length.
We now have a GRanges, centred on the summit, highest point of signal for every peak.
## GRanges object with 13777 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 4785515-4785614 * | 5.23296
## [2] chr1 5083072-5083171 * | 4.26417
## [3] chr1 7397785-7397884 * | 9.84580
## [4] chr1 9545350-9545449 * | 11.85079
## [5] chr1 9700928-9701027 * | 5.14291
## ... ... ... ... . ...
## [13773] chrX 168674050-168674149 * | 5.21700
## [13774] chrX 169047705-169047804 * | 5.33257
## [13775] chrX 169320320-169320419 * | 4.67030
## [13776] chrY 1245745-1245844 * | 4.04550
## [13777] chrY 1286577-1286676 * | 5.00143
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Once we have recentered our peaks we can use the getSeq function with our GRanges of resized common peaks and the BSgenome object for mm10.
The getSeq function returns a DNAStringSet object containing sequences under peaks.
peaksSequences <- getSeq(BSgenome.Mmusculus.UCSC.mm10, macsSummits_GR)
names(peaksSequences) <- paste0(seqnames(macsSummits_GR), ":", start(macsSummits_GR),
"-", end(macsSummits_GR))
peaksSequences[1:2, ]## DNAStringSet object of length 2:
## width seq names
## [1] 100 CGTCCATCGCCAGAGTGACGCGG...CTGGGCTTCAGGTTGGCCAGGCT chr1:4785515-4785614
## [2] 100 CCGCCCTCAGCTGCGGTCACGTG...TGAGTGAGGCTCGCAACGTCTCC chr1:5083072-5083171
The writeXStringSet function allows the user to write DNA/RNA/AA(aminoacid)StringSet objects out to file.
By default the writeXStringSet function writes the sequence information in FASTA format (as required for Meme-ChIP).
Now the file “mycMel_rep1.fa” contains sequences around the geometric centre of peaks suitable for Motif analysis in Meme-ChIP.
In your own work you will typically run this from your own laptop with Meme installed locally but today we will upload our generated FASTA file to their web portal.
Results files from Meme-ChIP can be found here
We can retrieve the locations of Myc motifs identified in MEME-ChIP from the FIMO output.
Fimo reports Myc motif locations as a GFF3 file which we should be able to visualise in IGV. Sadly, this GFF file’s naming conventions cause only a fraction of motifs to be reported.
Fortunately we can parse our motif’s GFF file into R and address this using the import() function in the rtracklayer package.
We can give the sequences some more sensible names and export the GFF to file to visualise in IGV.
motifGFF$Name <- paste0(seqnames(motifGFF), ":", start(motifGFF), "-", end(motifGFF))
motifGFF$ID <- paste0(seqnames(motifGFF), ":", start(motifGFF), "-", end(motifGFF))
export.gff3(motifGFF, con = "~/Downloads/fimoUpdated.gff")Answers can be found here